{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 中文分词技术"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 设置 Jupyter Notebook\n",
    "\n",
    "设置 Jupyter Notebook 的输出，只打印有用的信息，忽略没用的 warning 信息。\n",
    "\n",
    "如果使用 Jupyter Notebook, 请查看[Using Jupyter Notebook](https://rasa.com/docs/rasa/api/jupyter-notebooks/)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import nest_asyncio\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Logging settings\n",
    "import logging, io, json, warnings, pprint\n",
    "logging.basicConfig(level=\"ERROR\")\n",
    "warnings.filterwarnings('ignore')\n",
    "pp = pprint.PrettyPrinter(indent=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 制作中文语料\n",
    "\n",
    "Rasa NLU的实体识别和意图识别的任务，需要一个训练好的MITIE的模型。这个MITIE模型是非监督训练得到的，类似于word2vec中的word embedding。\n",
    "\n",
    "要训练这个MITIE模型，我们需要一个规模比较大的中文语料。最好的方法是用对应自己需求的语料，比如做金融的chatbot就多去爬取些财经新闻，做医疗的chatbot就多获取些医疗相关文章。\n",
    "\n",
    "我使用的是 [awesome-chinese-nlp](https://github.com/crownpku/awesome-chinese-nlp) 中列出的中文wikipedia dump和百度百科语料。其中关于wikipedia dump的处理可以参考[这篇帖子](http://blog.csdn.net/qq_32166627/article/details/68942216)。\n",
    "\n",
    "仅仅获取语料还不够，因为MITIE模型训练的输入是以词为单位的。所以要先进行分词，我们使用结巴分词。\n",
    "\n",
    "[这篇文章](http://www.crownpku.com/2017/07/27/%E7%94%A8Rasa_NLU%E6%9E%84%E5%BB%BA%E8%87%AA%E5%B7%B1%E7%9A%84%E4%B8%AD%E6%96%87NLU%E7%B3%BB%E7%BB%9F.html)用中文wikipedia和百度百科语料生成了一个total_word_feature_extractor_chi.dat，分享如下：\n",
    "\n",
    "```\n",
    "链接：https://pan.baidu.com/s/1kNENvlHLYWZIddmtWJ7Pdg 密码：p4vx\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## MITIE模型训练\n",
    "\n",
    "我们把所有分好词的语料文件放在同一个文件路径下。接下来我们要训练MITIE模型。\n",
    "\n",
    "首先将MITIE clone下来：\n",
    "\n",
    "```\n",
    "$ git clone https://github.com/mit-nlp/MITIE.git\n",
    "```\n",
    "\n",
    "我们要使用的只是MITIE其中wordrep这一个工具。我们先build它。\n",
    "```\n",
    "$ cd MITIE/tools/wordrep\n",
    "$ mkdir build\n",
    "$ cd build\n",
    "$ cmake ..\n",
    "$ cmake --build . --config Release\n",
    "```\n",
    "\n",
    "然后训练模型，得到total_word_feature_extractor.dat。注意这一步训练会耗费几十GB的内存，大概需要两到三天的时间。。。\n",
    "\n",
    "```\n",
    "$ ./wordrep -e /path/to/your/folder_of_cutted_text_files\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 中文分词\n",
    "中文自动分词目前主要可以归纳为 “规则分词”，“统计分词”，“混合分词(规则 + 统计)”。\n",
    "\n",
    "规则分词就是通过人工设立词库，然后按照一定的方法进行切分，简单高效。但这种方法对于新词的处理非常麻烦。\n",
    "\n",
    "统计分词就是把每个词看做是由词的最小单位的各个字组成的，如果相连的字在不同的文本中出现的此处越多，就证明这相连的字很可能就是一个词。这种方法的缺点是极大得依赖语料库。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Jieba 分词\n",
    "\n",
    "中文由于没有空格，所以需要先做分词，此处我们使用 `jieba` 分词, 以下是 `jieba` 分词的例子. Jieba 提供了三种分词模式:\n",
    "\n",
    "* 精确模式: 试图将句子最精确地其可爱，适合文本分析\n",
    "* 全模式: 把句子中所有可以成词的词语都扫描出来，速度非常快，但是不能解决歧义\n",
    "* 搜索匹配模式: 在精确模式的基础上，对长词再次切分，提高召回率，适合用于搜索引擎分词"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import jieba\n",
    "jieba.setLogLevel(logging.INFO) # define the jieba log level\n",
    "\n",
    "sent = \"中文分词是文本处理不可或缺的一步!\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "seg_list = jieba.cut(sent, cut_all=True)\n",
    "print(\"全模式: \" + \",\".join(seg_list))  # 全模式\n",
    "\n",
    "seg_list = jieba.cut(sent, cut_all=False)\n",
    "print(\"精确模式: \" + \",\".join(seg_list))  # 精确模式（默认）\n",
    "\n",
    "seg_list = jieba.cut_for_search(sent)\n",
    "print(\"搜索引擎模式: \" + \",\".join(seg_list))  # 搜索引擎模式"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 中文词性标注\n",
    "\n",
    "每一个词性都有标准的标注规范，请参考 [词性标注规范表](https://blog.csdn.net/jdjh1024/article/details/81318635).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import jieba.posseg as psg\n",
    "\n",
    "sent = \"中文分词是文本处理不可或缺的一步！\"\n",
    "\n",
    "seg_list = psg.cut(sent)\n",
    "\n",
    "print(' '.join(['{0}/{1}'.format(w, t) for w, t in seg_list]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
